Terrorism
903ceb0ed2d5ceec6e2c9b317b6c54a8-Paper-Conference.pdf
Recent advances in Large Vision-Language Models (LVLMs) have showcased strong reasoning abilities across multiple modalities, achieving significant breakthroughs in various real-world applications. Despite this great success, the safety guardrail of LVLMs may not cover the unforeseen domains introduced by the visual modality. Existing studies primarily focus on eliciting LVLMs to generate harmful responses via carefully crafted image-based jailbreaks designed to bypass alignment defenses. In this study, we reveal that a safe image can be exploited to achieve the same jailbreak consequence when combined with additional safe images and prompts. This stems from two fundamental properties of LVLMs: universal reasoning capabilities and safety snowball effect. Building on these insights, we propose Safety Snowball Agent (SSA), a novel agent-based framework leveraging agents' autonomous and tool-using abilities to jailbreak LVLMs. SSAoperates through two principal stages: (1) initial response generation, where tools generate or retrieve jailbreak images based on potential harmful intents, and (2) harmful snowballing, where refined subsequent prompts induce progressively harmful outputs. Our experiments demonstrate that SSAcan use nearly any image to induce LVLMs to produce unsafe content, achieving high success jailbreaking rates against the latest LVLMs. Unlike prior works that exploit alignment flaws, SSAleverages the inherent properties of LVLMs, presenting a profound challenge for enforcing safety in generative multimodal systems.
On the Robustness of Transformers against Context Hijacking for Linear Classification
Transformer-based Large Language Models (LLMs) have demonstrated powerful in-context learning capabilities. However, their predictions can be disrupted by factually correct context, a phenomenon known as context hijacking, revealing a significant robustness issue. To understand this phenomenon theoretically, we explore an in-context linear classification problem based on recent advances in linear transformers. In our setup, context tokens are designed as factually correct query-answer pairs, where the queries are similar to the final query but have opposite labels. Then, we develop a general theoretical analysis on the robustness of the linear transformers, which is formulated as a function of the model depth, training context lengths, and number of hijacking context tokens. A key finding is that a well-trained deeper transformer can achieve higher robustness, which aligns with empirical observations. We show that this improvement arises because deeper layers enable more fine-grained optimization steps, effectively mitigating interference from context hijacking. This is also well supported by our numerical and real-world experiments. Our findings provide theoretical insights into the benefits of deeper architectures and contribute to enhancing the understanding of transformer architectures.
CoTRed-Handed: Stress Testing Chain-of-Thought Monitoring
As AI models are deployed with increasing autonomy, it is important to ensure they do not take harmful actions unnoticed. As a potential mitigation, we investigate Chain-of-Thought (CoT) monitoring, wherein a weaker trusted monitor model continuously oversees the intermediate reasoning steps of a more powerful but untrusted model. We compare CoT monitoring to action-only monitoring, where only final outputs are reviewed, in a red-teaming setup where the untrusted model is instructed to pursue harmful side tasks while completing a coding problem. We find that while CoT monitoring is more effective than overseeing only model outputs in scenarios where action-only monitoring fails to reliably identify sabotage, reasoning traces can contain misleading rationalizations that deceive the CoT monitors, reducing performance in obvious sabotage cases. To address this, we introduce a hybrid protocol that independently scores model reasoning and actions, and combines them using a weighted average. Our hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates twice higher than action-only monitoring for subtle deception scenarios.
Safety Pretraining: Toward the Next Generation of Safe AI
As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during pretraining, they are hard to remove. In this work, we present a data-centric pretraining framework that builds safety into the model from the start. Our framework consists of four key steps: (i) Safety Filtering: building a safety classifier to classify webdata into safe and unsafe categories; (ii) Safety Rephrasing: we recontextualize unsafe webdata into safer narratives; (iii) Native Refusal: we synthetically generate pretraining datasets that actively teach models to refuse on unsafe content and the moral reasoning behind it, and (iv) Harmfulness-Tag annotated pretraining: we flag unsafe content during pretraining using a special token, and use it to steer models away from unsafe generations at inference-time. Our safety-pretrained models reduce attack success rates from 38.8% to 8.4% on standard LLM safety benchmarks with no performance degradation on general tasks.
Jรผrgen Habermas Defended Reason in a Darkening Age
The great German philosopher, who died in March, understood how much depended on a principled public sphere. Habermas emerged from the uncompromising Frankfurt School, but his work was considerably less fatalistic. You wake up and brace yourself for the barrage of toxic gibberish that constitutes the modern public sphere. Your e-mail is overrun with spam, scams, and smut. There are voice mails from no one about nothing. A glance at the news reveals that the President is continuing to spew lies and obscenities; that a trillionaire is peddling white-supremacist propaganda on a social-media platform he owns; that a chart-topping musical artist is praising Hitler, or apologizing for praising Hitler, or praising Hitler once again. Publications from the on down employ clickbait headlines that treat you like a starving rat in a Pavlovian experiment. A.I. systems simulate the experience of talking to an arrogant ten-year-old boy who knows far less than he thinks he does. When pressed, the chatbots admit that they cannot "naturally understand human morality, dignity, culture, or meaning." It all adds up to a continuous discursive tinnitus--a buzz of random, fake, stupid, sinister chatter that nobody wants and nobody can stop. The person who should have been best able to explain how we got here was the great German philosopher Jรผrgen Habermas, who illuminated how a feisty, principled public sphere is integral to democracy. But Habermas died in March, at the age of ninety-six, and, although he remained active until his final months, commenting on Ukraine, Gaza, and Eurobonds, he struggled to understand the turn history had taken. As a teen-ager in 1945, he had witnessed American soldiers enter his home town of Gummersbach, near Cologne, carrying messages of freedom and openness. Eight decades later, he watched American voters choose a leader who had advertised his fascistic bent in blood-and-soil rhetoric, fantasies of punitive violence, and a taste for bombastic architectural kitsch.
BAM-ICL: Causal Hijacking In-Context Learning with Budgeted Adversarial Manipulation
Recent research shows that large language models (LLMs) are vulnerable to hijacking attacks under the scenario of in-context learning (ICL) where LLMs demonstrate impressive capabilities in performing tasks by conditioning on a sequence of in-context examples (ICEs) (i.e., prompts with task-specific input-output pairs). Adversaries can manipulate the provided ICEs to steer the model toward attackerspecified outputs, effectively "hijacking" the model's decision-making process. Unlike traditional adversarial attacks targeting single inputs, hijacking attacks in LLMs aim to subtly manipulate the initial few examples to influence the model's behavior across a range of subsequent inputs, which requires distributed and stealthy perturbations. However, existing approaches overlook how to effectively allocate the perturbation budget across ICEs. We argue that fixed budgets miss the potential of dynamic reallocation to improve attack success while maintaining high stealthiness and text quality.
Russia to task bankers with shooting down Ukrainian drones
Russian lawmakers have passed a bill to allow trained bank employees to shoot down Ukrainian drones amid an increase in the number of attacks. The draft legislation, which would see banks across Russia install electronic jamming systems while selected employees would shoot down incoming unmanned aircraft, passed in its third and final reading in the lower house Duma on Tuesday, according to the state-run TASS news agency. The bill says the legislation is needed to protect Bank of Russia facilities, including those located in the new constituent entities of the Russian Federation - referring to the four eastern Ukrainian regions that Moscow has announced it has annexed despite not controlling them fully - amid the increasing number of sabotage and terrorist attacks. Under the plan, banks would finance the installation of the equipment on their premises. With banks in almost every town, their incorporation into Russia's air defences could help expand its cover.
Israel strikes Tyre after ordering evacuation of south Lebanon city
The Israeli military has said it is carrying out air strikes on Hezbollah targets in Tyre in southern Lebanon, after ordering the evacuation of the entire city. The military told residents that it was compelled to act forcefully in Tyre because the Iran-backed armed group was violating a US-brokered ceasefire that began five weeks ago. Earlier on Wednesday, Lebanese media reported a wave of Israeli strikes across the south and the eastern Bekaa Valley, with four people killed in the towns of Choukine and Nabatieh. Hezbollah, which has itself accused Israel of breaching the ceasefire, said it was battling Israeli troops north of the Litani river, about 30km (19 miles) from the border. It came a day after Israel's prime minister announced an expansion of its ground operation following Hezbollah drone attacks on troops occupying part of southern Lebanon and on civilians in northern Israel.
Eleven killed in Lebanon village as Israel intensifies strikes
Israel has launched an intensive wave of strikes across swathes of southern and eastern Lebanon, after vowing to step up its military action against Hezbollah. The Israeli military said it hit more than 100 Hezbollah infrastructure sites and fighters during what was one of the heaviest nights of bombardment since a US-brokered ceasefire began in mid-April. Strikes in the Bekaa Valley village of Mashghara killed 11 people, including two children, Lebanon's health ministry said. The military said it hit sites where terrorist activity was identified. It came after Israel's Prime Minister Benjamin Netanyahu said he had given the instruction to press the pedal even harder in targeting Hezbollah.